32 research outputs found
Streaming egocentric action anticipation: An evaluation scheme and approach
Egocentric action anticipation aims to predict the future actions the camera
wearer will perform from the observation of the past. While predictions about
the future should be available before the predicted events take place, most
approaches do not pay attention to the computational time required to make such
predictions. As a result, current evaluation schemes assume that predictions
are available right after the input video is observed, i.e., presuming a
negligible runtime, which may lead to overly optimistic evaluations. We propose
a streaming egocentric action evaluation scheme which assumes that predictions
are performed online and made available only after the model has processed the
current input segment, which depends on its runtime. To evaluate all models
considering the same prediction horizon, we hence propose that slower models
should base their predictions on temporal segments sampled ahead of time. Based
on the observation that model runtime can affect performance in the considered
streaming evaluation scenario, we further propose a lightweight action
anticipation model based on feed-forward 3D CNNs which is optimized using
knowledge distillation techniques with a novel past-to-future distillation
loss. Experiments on the three popular datasets EPIC-KITCHENS-55,
EPIC-KITCHENS-100 and EGTEA Gaze+ show that (i) the proposed evaluation scheme
induces a different ranking on state-of-the-art methods as compared to classic
evaluations, (ii) lightweight approaches tend to outmatch more computationally
expensive ones, and (iii) the proposed model based on feed-forward 3D CNNs and
knowledge distillation outperforms current art in the streaming egocentric
action anticipation scenario.Comment: Published in Computer Vision and Image Understanding, 2023. arXiv
admin note: text overlap with arXiv:2110.0538
StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
Anticipation problem has been studied considering different aspects such as
predicting humans' locations, predicting hands and objects trajectories, and
forecasting actions and human-object interactions. In this paper, we studied
the short-term object interaction anticipation problem from the egocentric
point of view, proposing a new end-to-end architecture named StillFast. Our
approach simultaneously processes a still image and a video detecting and
localizing next-active objects, predicting the verb which describes the future
interaction and determining when the interaction will start. Experiments on the
large-scale egocentric dataset EGO4D show that our method outperformed
state-of-the-art approaches on the considered task. Our method is ranked first
in the public leaderboard of the EGO4D short term object interaction
anticipation challenge 2022. Please see the project web page for code and
additional details: https://iplab.dmi.unict.it/stillfast/
MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain
Wearable cameras allow to acquire images and videos from the user's
perspective. These data can be processed to understand humans behavior. Despite
human behavior analysis has been thoroughly investigated in third person
vision, it is still understudied in egocentric settings and in particular in
industrial scenarios. To encourage research in this field, we present MECCANO,
a multimodal dataset of egocentric videos to study humans behavior
understanding in industrial-like settings. The multimodality is characterized
by the presence of gaze signals, depth maps and RGB videos acquired
simultaneously with a custom headset. The dataset has been explicitly labeled
for fundamental tasks in the context of human behavior understanding from a
first person view, such as recognizing and anticipating human-object
interactions. With the MECCANO dataset, we explored five different tasks
including 1) Action Recognition, 2) Active Objects Detection and Recognition,
3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and
5) Next-Active Objects Detection. We propose a benchmark aimed to study human
behavior in the considered industrial-like scenario which demonstrates that the
investigated tasks and the considered scenario are challenging for
state-of-the-art algorithms. To support research in this field, we publicy
release the dataset at https://iplab.dmi.unict.it/MECCANO/.Comment: arXiv admin note: text overlap with arXiv:2010.0565
Is First Person Vision Challenging for Object Tracking?
Understanding human-object interactions is fundamental in First Person Vision
(FPV). Tracking algorithms which follow the objects manipulated by the camera
wearer can provide useful cues to effectively model such interactions. Visual
tracking solutions available in the computer vision literature have
significantly improved their performance in the last years for a large variety
of target objects and tracking scenarios. However, despite a few previous
attempts to exploit trackers in FPV applications, a methodical analysis of the
performance of state-of-the-art trackers in this domain is still missing. In
this paper, we fill the gap by presenting the first systematic study of object
tracking in FPV. Our study extensively analyses the performance of recent
visual trackers and baseline FPV trackers with respect to different aspects and
considering a new performance measure. This is achieved through TREK-150, a
novel benchmark dataset composed of 150 densely annotated video sequences. Our
results show that object tracking in FPV is challenging, which suggests that
more research efforts should be devoted to this problem so that tracking could
benefit FPV tasks.Comment: IEEE/CVF International Conference on Computer Vision (ICCV) 2021,
Visual Object Tracking Challenge VOT2021 workshop. arXiv admin note: text
overlap with arXiv:2011.1226